6 research outputs found
Node Classification in Uncertain Graphs
In many real applications that use and analyze networked data, the links in
the network graph may be erroneous, or derived from probabilistic techniques.
In such cases, the node classification problem can be challenging, since the
unreliability of the links may affect the final results of the classification
process. If the information about link reliability is not used explicitly, the
classification accuracy in the underlying network may be affected adversely. In
this paper, we focus on situations that require the analysis of the uncertainty
that is present in the graph structure. We study the novel problem of node
classification in uncertain graphs, by treating uncertainty as a first-class
citizen. We propose two techniques based on a Bayes model and automatic
parameter selection, and show that the incorporation of uncertainty in the
classification process as a first-class citizen is beneficial. We
experimentally evaluate the proposed approach using different real data sets,
and study the behavior of the algorithms under different conditions. The
results demonstrate the effectiveness and efficiency of our approach
Modeling and Querying Data Series and Data Streams with Uncertainty
Many real applications consume data that is intrinsically uncertain and error-prone. An uncertain data series is a series whose point values are uncertain. An uncertain data stream is a data stream whose tuples are existentially uncertain and/or have an uncertain value. Typical sources of uncertainty in data series and data streams include sensor data, data synopses, privacy-preserving transformations and forecasting models.
In this thesis, we focus on the following three problems: (1) the formulation and the evaluation of similarity search queries in uncertain data series; (2) the evaluation of nearest neighbor search queries in uncertain data series; (3) the adaptation of sliding windows in uncertain data stream processing to accommodate existential and value uncertainty.
We demonstrate experimentally that the correlation among neighboring time-stamps in data series can be leveraged to increase the accuracy of the results. We further show that the "possible world" semantics can be used as underlying uncertainty model to formulate nearest neighbor queries that can be evaluated efficiently. Finally, we discuss the relation between existential and value uncertainty in data stream applications,
and verify experimentally our proposal of uncertain sliding windows
NADEEF: a commodity data cleaning system
Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system